40923240 cp2020 webside

  • Home
    • Site Map
    • reveal
    • blog
  • HomePage
  • Video
    • 架設倉儲
    • 設置SSH-KEY
  • Schedule
    • HW1 ALL(100%)
    • HW2-1
    • HW3 ALL(100%)
    • 網站更新
  • Group2 List
  • Debug
  • HW1
    • PCH 11 Networking Media(網絡媒體)
      • PCH 11 Fiber-Optic-Cable(光纖電纜)
      • PCH 11 Twisted-Pair-Cable(雙絞線電纜)
      • PCH 11 Coaxial-Cable(同軸電纜)
      • PCH 11 Copper vs Fiber(銅vs光纖)
    • PCH 11 Converters(轉換器)
    • PCH 12 Ethernet Standards(以太網標準)
      • PCH 12 Ethernet Technology(以太網技術)
      • PCH 12 Carrier Sense Multiple Access and Collision Detect
      • PCH 12 Evolution of Standards(標準的演變)
    • PCH 12 Differentiating Between Ethernet and TCP or IP
  • HW2-2
    • 2-1(亂數分組)
    • 2-2(加退選)
    • 2-3
  • HW3
    • 7.List Comprehensions(理解串列)
    • 8.Rock paper sciorrs(剪刀石頭布)
    • 17.Decode a web page(解碼網站)
8.Rock paper sciorrs(剪刀石頭布) << Previous

17.Decode a web page(解碼網站)

Use the BeautifulSoup and requests Python packages to print out a list of all the article titles on the New York Times homepage.

使用BeautifulSoup並請求Python程序包在《紐約時報》主頁上列出所有文章標題的列表。

Discussion(討論)

Concepts for this week:

這週會應用到的有:

  • Libraries(檔案庫)
  • requests(要求)
  • BeautifulSoup

Libraries

Many people have written libraries in Python that do not come with the standard distribution of Python (like the random library mentioned in a previous post). These libraries can do anything from machine learning to date and time formatting to meme generation. If you have a task you need done, most likely someone has written a library for it.

許多人都使用Python編寫了一些資料庫,這些資料庫不是Python的標準發行版所附帶的(例如前一篇文章中提到的隨機庫)。這些資料庫可以執行從機器學習到日期和時間格式化到模因生成的所有操作。如果你有需要完成的任務,很可能會有人為它編寫了一個資料庫。

There are three main things to keep in mind when using a library:

使用資料庫時,要牢記三件事:
1.You need to install it. Installation in GNU/Linux based systems will generally be easier than on Windows or OSX, but there will always be documentation for how to do it.

你會需要安裝它。在基於GNU / Linux的系統上的安裝通常比在Windows或OSX上更容易,但是會有關於如何執行的文檔。

2.You need to import it. At the top of your program, make sure you write the line import requests, or whatever the name of your library is. Then you can use it to your heart’s content.

你需要導入它。在程序的頂部,請確保您編寫了一行導入請求,或是輸入你的資料庫名稱。然後,你可以使用它來滿足您的需求。

3.You need to read documentation. Someone else wrote it, so the rules might not be so obvious. Anyone (or any group) that writes a Python package writes documentation for it. Eventually, reading documentation will become second nature.

你需要閱讀文件資料。其他人也會編輯,所以規則可能不會那麼明顯。編寫Python程序包的任何人(或任何小組)都會為其編寫文檔。最終,閱讀文檔將習慣成自然。

Requests

One of the most useful libraries written for Python recently, requests does “HTTP for humans.” What this means in laymen’s terms is that it asks the internet things from Python. When you type “facebook.com” into the browser, you are asking the internet to show you Facebook’s homepage.

最近Python編寫的最有用的資料庫之一為,請求執行“人類HTTP”。用外行的話來說,這意味著它將向Python詢問互聯網問題。在瀏覽器中輸入“ facebook.com”時,你是在要求互聯網顯示Facebook主頁。

In the same way, a program can ask the internet something. It might not be “show me Facebook”, but you can for example ask Github for a list of all the repositories that the user “mprat” has. You can do this with an API (Application Programming Interface). This exercise doesn’t use APIs, so we’ll talk more about those in a later post.

同樣,程序可以向互聯網詢問一些信息。它可能不是“向我顯示Facebook”,但是你可以做個測試來要求Github提供用戶“ mprat”擁有的所有存儲庫的列表。你可以使用API(應用程序編程接口)來執行此操作。本練習不使用應用程式介面,因此我們將在以後的文章中進一步討論。

Back to showing the user a webpage. When I type “facebook.com” into the browser, Facebook sends my browser a bunch of HTML (basically, code for how the website looks). The browser then takes this HTML and shows it to me in a pretty way. (Fun fact: to see the HTML of any page in a browser, right click on the page and “Inspect Element” or “View Source” depending on your browser. In Chrome, “Inspect Element” will pop up a module at the bottom of your page where you can see the HTML from the page. This trick will come in handy when you’re doing the exercise. If you need to DO anything with this HTML, better to use a program. More posts about this coming later.) If I want to “see” a webpage with a program, all I need to do is ask it for it’s HTML and read it
返回向用戶顯示網頁。當我在瀏覽器中鍵入“ facebook.com”時,Facebook向我的瀏覽器發送了一堆HTML(基本上是網站外觀的代碼)。然後瀏覽器將這個HTML並以漂亮的方式顯示給我。 (一個有趣的事實:要在瀏覽器中查看任何頁面的HTML,請右鍵單擊該頁面,然後根據你的瀏覽器單擊“檢查元素”或“查看源代碼”。在Chrome中,“檢查元素”將在底部彈出一個模塊頁面的頂部,您可以在其中查看頁面的HTML。在進行練習時,此技巧會派上用場。如果您需要使用此HTML做任何事情,最好使用一個程序。 )如果我想“查看”帶有程序的網頁,我所要做的就是向其索要HTML並閱讀它

The ‘requests’ library does half of that job: it asks (requests, if you will) a server for information. This could be just data (through an API - more later) or in the case of this exercise, HTML.

“請求”資料庫完成了一半工作:它向服務器請求(如果需要的話)信息。這可能只是數據(通過API-稍後介紹),或者在本練習中為HTML。

Look at the documentation for all the details you need. In this particular latest version, all you need to do to ask a website for it’s HTML is:

查看文件檔以獲取所需的所有詳細信息。在此特定的最新版本中,向網站詢問其HTML所需要做的就是:

import requests
  url = 'http://github.com'
  r = requests.get(url)
  r_html = r.text

Now inside the variable r_html, you have the HTML of the page as a string. Reading (otherwise called parsing) happens with a different Python package.

現在,在變量r_html中,你必須將頁面的HTML作為字符串。讀取(其他中說法為解析)發生在其他Python資料包中。

BeautifulSoup

To solve our problem of parsing (reading, understanding, interpreting) the string of HTML we got from requests, we use the BeautifulSoup library.

為了解決解析(閱讀,理解,解釋過程)從請求中獲得的HTML字符串的問題,我們使用BeautifulSoup資料庫。

What it does is give a hierarchical (a pyramid structure) to the HTML in the document. If you don’t know anything about HTML, the Wikipedia article is a good summary. For the purposes of this exercise, you don’t need to know anything about HTML beyond being able to look at it quickly.

它的作用是為文檔中的HTML提供階級式結構(金字塔結構)。如果你對HTML一無所知,請參閱Wikipedia文章。在本練習中,除了能夠快速查看HTML外,你不需要了解其他任何有關HTML的知識。

Because BeautifulSoup takes care of interpreting our HTML for us, we can ask it things like: “give me all the lines with <p> tags” or “find me the parent element to the <title> element”, etc.

因為BeautifulSoup會負責為我們解釋HTML,所以我們可以提出類似的要求:"給我所有帶有<p>標籤的行列"或"為母元素找到<title>元素",等等。

Your code would look something like this:

你寫出的編碼將會像這樣:

from bs4 import BeautifulSoup

  # some requests code here for getting r_html 

  soup = BeautifulSoup(r_html)
  title = soup.find('span', 'articletitle').string

And you can do many more things in BeautifulSoup, but I will leave you to explore those by yourself or through other later exercises.

你也可以在BeautifulSoup中做更多的事情,但是我將會讓你自己或通過其他後續練習來探索這些內容。

Happy coding!

練習加油!!


8.Rock paper sciorrs(剪刀石頭布) << Previous

Copyright © All rights reserved | This template is made with by Colorlib